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Abstract 

When applying machine learning to prob¬ 
lems in NLP, there are many choices to 
make about how to represent input texts. 
These choices can have a big effect on per¬ 
formance, but they are often uninteresting 
to researchers or practitioners who simply 
need a module that performs well. We 
propose an approach to optimizing over 
this space of choices, formulating the prob¬ 
lem as global optimization. We apply a 
sequential model-based optimization tech¬ 
nique and show that our method makes 
standard linear models competitive with 
more sophisticated, expensive state-of-the- 
art methods based on latent variable models 
or neural networks on various topic classi¬ 
fication and sentiment analysis problems. 

Our approach is a first step towards black¬ 
box NLP systems that work with raw text 
and do not require manual tuning. 

1 Introduction 

NLP researchers and practitioners spend a consid¬ 
erable amount of time comparing machine-learned 
models of text that differ in relatively uninteresting 
ways. For example, in categorizing texts, should 
the “bag of words” include bigrams, and is tf-idf 
weighting a good idea? These choices matter exper¬ 
imentally, often leading to big differences in per¬ 
formance, with little consistency across tasks and 
datasets in which combination of choices works 
best. Unfortunately, these differences tell us lit¬ 
tle about language or the problems that machine 
learners are supposed to solve. 

We propose that these decisions can be auto¬ 
mated in a similar way to hyperparameter selec¬ 
tion (e.g., choosing the strength of a ridge or lasso 
regularizer). Given a particular text dataset and 
classification task, we introduce a technique for op¬ 
timizing over the space of representational choices. 


along with other “nuisances” that interact with 
these decisions, like hyperparameter selectionj^ 
For example, using higher-order n-grams means 
more features and a need for stronger regulariza¬ 
tion and more training iterations. Generally, these 
decisions about instance representation are made 
by humans, heuristically; our work is the first to 
automate them. 

Our technique instantiates sequential model- 
based optimization (SMBO; Flutter et al., 2011). 
SMBO and other Bayesian optimization ap¬ 
proaches have been shown to work well for hyper¬ 
parameter tuning (Bergstra et al., 20TT] Hoffman 


et al., 2011 Snoek et al., 2012| ). Though popular 


in computer vision ( Bergstra et al., 2013| ), these 
techniques have received little attention in NLP. 

We apply the technique to logistic regression 
on a range of topic and sentiment classification 
tasks. Consistently, our method finds representa¬ 
tional choices that perform better than linear base¬ 
lines previously reported in the literature, and that, 
in some cases, are competitive with more sophisti¬ 
cated non-linear models trained using neural net¬ 
works. 


2 Problem Formulation and Notation 

Let the training data consist of a collection of pairs 
Strain = {{d.ii,d.oi),...,{d.in,d.On)), where 
each input d.i G J is a text document and each 
output d.o G 0, the output space. The overall 
training goal is to maximize a performance func¬ 
tion / (e.g., classification accuracy, log-likelihood, 
Fi score, etc.) of a machine-learned model, on a 
held-out dataset, d^ev F (? x 0)"''. 

Classfication proceeds in three steps: first, x : 

3 —)■ maps each input to a vector representation. 

Second, a classifier is learned from the inputs (now 
transformed into vectors) and outputs: L : (M^ x 
0)" —)■ (M^ 0). Finally, the resulting classifier 

’in Sj^we argue that the technique is also applicable in 
unsupervised settings. 













c : J — )• 0 is fixed as 

I-' ( (Strain ) ° ^ 

rn 

(i.e., the composition of the representation function 
with the learned classifier). 

Here we consider linear classifiers of the form 

c{d.i) = argmaxwjx((i.t) (1) 

oeO 

where the coefficients Wo G for each output o, 
are learned using logistic regression on the training 
data. We let w denote the concatenation of all 
Wq. Hence the parameters can be understood as a 
function of the training data and the representation 
function x. The performance function /, in turn, is 
a function of the held-out data and x—also w 

and dtrain, through x. For simplicity, we will write 
“/(x)” when the rest are clear from context. 

Typically, x is fixed by the model designer, per¬ 
haps after some experimentation, and learning fo¬ 
cuses on selecting the parameters w. For logistic 
regression and many other linear models, this train¬ 
ing step reduces to convex optimization in A^|0| 
dimensions—a solvable problem that is still costly 
for large datasets and/or large output spaces. In 
seeking to maximize / with respect to x, we do 
not wish to carry out training any more times than 
necessary. 

Choosing x can be understood as a problem of 
selecting hyperparameter values. We therefore turn 
to Bayesian optimization, a family of techniques 
recently introduced for selecting hyperparameter 
values intelligently when solving for parameters 
(w) is costly. 

3 Bayesian Optimization 

Our approach is based on sequential model-based 
optimization (SMBO; Hutter et al., 2011). It 
iteratively chooses representation functions x. 
On each round, it makes this choice through a 
nonparametrically-estimated probabilistic model 
of /, then evaluates /—we call this a “trial.” As 
in any iterative search algorithm, the goal is to 
balance exploration of options for x with exploita¬ 
tion of previously-explored options, so that a good 
choice is found in a small number of trials. See 
Algorithm [T] 

More concretely, in the tth trial, x^ is selected 
using an acquisition function A and a “surrogate” 
probabilistic model pt. Second, / is evaluated 


given xt—an expensive operation which involves 
training to select parameters w and assessing per¬ 
formance on the held-out data. Third, the prob¬ 
abilistic model is updated using a nonparametric 
estimator. 


Algorithm 1 SMBO algorithm 

Input: number of trials T, target function / 
Pi = initial surrogate model 
Initialize y* 
for t = 1 to r do 

Xt argmax^ A(x;pt,y*) 
yt ^ evaluate f{xt) 

Update y* 

Estimate pt given xi-t and yi^t 

end for 


We next describe the acquisition function A and 
the surrogate model pt used in our experiments. 

3.1 Acquisition Function 

A good acquisition function returns high values 
for X such that either the value /(x) is predicted 
to be high, or because uncertainty about /(x)’s 
value is high; balancing between these is the classic 
tradeoff between exploitation and exploration. We 
use a criterion called Expected Improvement (El; 
Jones, 2001), which is the expectation (under the 
current surrogate model pt) that the choice y will 
exceed y*: 

/ OO 

max(y - y*,0)pt{y \ x)dy 

-OO 


where y* is chosen depending on the surrogate 
model, discussed below. (Eor now, think of it as a 
strongly-performing “benchmark” value of /, dis¬ 
covered in earlier iterations.) Other options for the 
acquisition function include maximum probability 
of improvement (Jones, 20011, minimum condi¬ 
tional entropy ([Villemonteix et al., 2006|l, Gaussian 


process upper confidence bound (Srinivas et al. 


2010[), or a combination of them ([Hoffman et al.J 


20111. We selected El because it is the most widely 


used acquisition function that has been shown to 
work well on a range of tasks. 


3.2 Surrogate Model 

As a surrogate model, we use a tree-structured 
Parzen estimator (TPE; Bergstra et al., 2011). This 
is a nonparametric approach to density estimation. 
We seek to estimate pt{y \ x) where y = /(x), the 














performance function that is expensive to compute 
exactly. The TPE approach is as follows: 


Pt{y I x) (xpt{y) •pt(x I y) 


Pt(x I y) 


Pt^(x), iiy<y* 

Pr(x), ify>y* 


where pf and pp are densities estimated using ob¬ 
servations from previous trials that are less than 
and greater than y*, respectively. In TPE, y* is 
defined as some quantile of the observed y; we use 
15-quantiles. 

As shown by Bergstra et al. (20ll] ), the Expected 
Improvement in TPE can be written as: 


‘A(x;pt,y*) oc (7-f%^(l-7) 

V Pr(x) 


( 2 ) 


where 7 = pt{y < y*), hxed at 0.15 by dehni- 
tion of y* (above). Here, we prefer x with high 
probability under pf (x) and low probability under 
pf (x). To maximize this quantity, we draw many 
candidates according to pp (x) and evaluate them 
according to pf{x.)/pf{x). Note that p{y) does 
not need to be given an explicit form. 

In order to evaluate Eq. we need to compute 
pf (x) and Pf (x). These joint distributions de¬ 
pend on the graphical model of the hyperparameter 
space—which is allowed to form a tree structure. 

We discuss how to compute pf (x) in the fol¬ 
lowing. pp (x) is computed similarly, using trials 
where y > y*. We associate each hyperparameter 
with a node in the graphical model; consider the 
kth dimension of x, denoted by random variable 

• If ranges over a discrete set X, TPE uses a 
reweighted categorical distribution, where the 
probability that = x is proportional to a 
smoothing parameter plus the counts of occur¬ 
rences of X^ = X in x^.j with yt < y*. 

• When X^ is continuous-valued, TPE constructs 
a probability distribution by placing a truncated 
Gaussian distribution centered at each of x^ 
where yt < y*, with standard deviation set to 
the greater of the distances to the left and right 
neighbors. 

In the simplest version, each node is independent, 
so we can compute pf{x) by multiplying indi¬ 
vidual probabilities at every node. In the tree- 
structured version, we only multiply probabilities 
along the relevant path, excluding some nodes. 


Another common approach to the surrogate is 
the Gaussian Process ( [Rasmussen and Williams, 
2006 Hoffman et ah, 2011 jSnoek et al., 2012 1. 


Like Bergstra et al. (20111, our preliminary exper¬ 


iments found the TPE to perform favorably. Eur- 
ther TPE’s tree-structured conhguration space is 
advantageous, because it allows nested definitions 
of hyperparameters, which we exploit in our exper¬ 
iments (e.g., only allows bigrams to be chosen if 
unigrams are also chosen). 

3.3 Implementation Details 

Because research on SMBO is active, many im¬ 
plementations are publicly available; we use the 
HPOlib library ( Eggensperger et ah, 2013| PlThe 
libray takes as input a function L, which is treated 
as a black box—in our case, a logistic regression 


trainer that wraps the LIB LINEAR library (Ean 


et al., 20081, based on the trust region Newton 


method (Lin et al., 20081—and a specihcation of 
hyperparameters. 

4 Experiments 

Our experiments consider representational choices 
and hyperparameters for several text categorization 
problems. 

4.1 Setup 

We hx our learner L to logistic regression. We 
optimize text representation based on the types of 
n-grams used, the type of weighting scheme, and 
the removal of stopwords. Eor n-grams, we have 
two parameters, minimum and maximum lengths 
(nmin and Umax)- (All n-gram lengths between 
the minimum and maximum, inclusive, are used.) 
Eor weighting scheme, we consider term frequency, 
tf-idf, and binary schemes. Last, we also choose 
whether we should remove stopwords before con¬ 
structing feature vectors for each document. 

Eurthermore, the choice of representation inter¬ 
acts with the regularizer and the training conver¬ 
gence criterion (e.g., more n-grams means slower 
training time). We consider two regularizers, £i 


penalty (Tibshirani, 1996) or squared £2 penalty 


(Hoerl and Kennard, 1970). We also have hyper¬ 
parameters for regularization strength and training 
convergence tolerance. See Table [T]for a complete 
list of hyperparameters in our experiments. 

Note that even with this limited number of 
options, the number of possible combinations is 


'http://www.automl.org/hpolib.html 























Hyperparameter 

Values 

^min 

^max 

weighting scheme 
remove stop words? 

{1,2,3} 

\jtmin 1 ■ ■ ■ 1 3} 

{tf, tf-idf, binary} 
{True, False} 

regularization 
regularization strength 
convergence tolerance 

{4,^2} 

[10-^ 10^] 

[10-^ 10-3] 


Dataset 

Training 

Dev. 

Test 

Stanford sentiment 

6,920 

872 

1,821 

Amazon electronics 

20,000 

5,000 

25,000 

IMDB reviews 

20,000 

5,000 

25,000 

Congress vote 

1,175 

113 

411 

20N all topics 

9,052 

2,262 

7,532 

20N all science 

1,899 

474 

1,579 

20N atheist.religion 

686 

171 

570 

20N X.graphics 

942 

235 

784 


Table 2: Document counts. 


Table 1: The set of hyperparameters considered in our ex¬ 
periments. The top half are hyperparameters related to text 
representation, while the bottom half are logistic regression 
hyperparameters, which also interact with the chosen repre¬ 
sentation. 


huge (it is actually infinite since the regularization 
strength and convergence tolerance are continuous 
values, although we can also use sets of possible 
values), so exhaustive search is computationally 
expensive. In all our experiments for all datasets, 
we limit ourselves to 30 trials per dataset. The only 
preprocessing we applied was downcasing (see ^ 
for discussion about this). 

We always use a development set to evaluate 
/(x) during learning and report the final resulf on 
an unseen test set. 


4.2 Datasets 


We evaluate our method on five fexf cafegorizafion 
tasks. 


Stanford sentiment treebank ( jSocher et al.j 
20131: a sentence-level sentiment analy¬ 

sis dataset for movie reviews from the 
rottentomatoes.com website. We use 
the binary classification task where the goal 
is to predict whether a review is positive or 
negative (no neutral reviews). We obtained 
this dataset from http : / /nip . Stanford . 
edu/sentiment 

Electronics product reviews from Amazon 


(McAuley and Leskovec, 20131: this dataset 


consists of electronic product reviews, which is 
a subset of a large Amazon review dataset. Fol¬ 
lowing the setup of Johnson and Zhang (2014| ), 
we only use the text section and ignore the 
summary section. We also only consider pos¬ 
itive and negative reviews. We obtained this 
dataset from http://riejohnson.com/ 
cnn_data.html 

IMDB movie reviews ( |Maas et ah, 2011] ): a 
binary sentiment analysis dataset of highly 


polar IMDB movie reviews, obtained from 

http://ai.Stanford.edu/~amaas/ 
/data/sentiment 

Congressional vote (Thomas et ah, 20061: tran¬ 
scripts from the U.S. Congressional floor de¬ 
bates. The dataset only includes debates 
for controversial bills (the losing side has 
at least 20% of the speeches). Similar to 
previous work (Thomas et ah, 20061 Yesse 


nalina et ah, 2010]), we consider the task 


to predict the vote (“yea” or “nay”) for the 
speaker of each speech segment (speaker-based 
speech-segment classification). We obtained 
it from http : / / www . cs . Cornell. edu/ 
~ainur/sle-data.html. 

20 Newsgroups (Lang, 1995): the 20 
Newsgroups dataset is a benchmark topic 
classification dataset, we use the publicly 


available copy at 

http 

://qwone.com/ 

~ jason /2 ONewsgroups 

There are 20 top- 


ics in this dataset. We derived four topic 
classification tasks from this dataset. The 
first task is to classify documents across all 
20 topics. The second task is to classify 
related science documents into four science 
topics (sci. crypt, sci . electronics, 
sci.med, sci.med). 0 The third and 
fourth tasks are talk . religion . misc 
vs. alt. atheism and comp. graphics 
vs. comp . windows . x. To consider a more 
realistic setting, we removed header information 
from each article since they often contain label 
information. 


These are standard datasets for evaluating text 
categorization models, where benchmark results 
are available. In total, we have eight tasks, of which 
four are sentiment analysis tasks and four are topic 
classification tasks. See Table for descriptive 

^We were not able to find previous results that are compa¬ 
rable to ours on the second task; we include them to enable 
further comparisons in the future. 
































Dataset 

Acc. 


‘^max 

Weighting 

Stop. 

Reg. 

Strength 

Conv. 

Stanford sentiment 

82.43 

1 

2 

tf-idf 

F 

i2 

10 

0.098 

Amazon electronics 

91.56 

1 

3 

binary 

F 

i2 

120 

0.022 

IMDB reviews 

90.85 

1 

2 

binary 

F 

l2 

147 

0.019 

Congress vote 

78.59 

2 

2 

binary 

F 

l2 

121 

0.012 

20N all topics 

87.84 

1 

2 

binary 

F 

£2 

16 

0.008 

20N all science 

95.82 

1 

2 

binary 

F 

£2 

142 

0.007 

20N atheist.religion 

86.32 

1 

2 

binary 

T 

£1 

41 

0.011 

20N x.graphics 

92.09 

1 

1 

binary 

T 

£2 

91 

0.014 


Table 3: Classification accuracies and the best hyperparameters for each of the dataset in our experiments. “Acc” shows 
accuracies for our logistic regression model. “Min” and “Max” correspond to the min n-grams and max n-grams respectively. 
“Stop.” is whether we perform stopwords removal or not. “Reg.” is the regularization type, “Strength” is the regularization 
strength, and “Conv.” is the convergence tolerance. For regularization strength, we round it to the nearest integer for readability. 


statistics of our datasets. 


4.3 Baselines 

For each dataset, we select supervised, non¬ 
ensemble classification methods from previous lit¬ 
erature as baselines. In each case, we emphasize 
comparisons with the best-published linear method 
(often an SVM with a linear kernel with represen¬ 
tation selected by experts) and the best-published 
method overall. In the followings, “SVM” always 
means “linear SVM”. All methods were trained and 
evaluated on the same training/testing data splits; 
in cases where standard development sets were not 
available, we used a random 20% of the training 
data as a development set. 


4.4 Results 

We summarize the hyperparameters selected by our 
method, and the accuracies achieved (on test data) 
in Table We discuss comparisons to baselines 
for each dataset in turn. 


Stanford sentiment treebank (Table|^. Our lo¬ 
gistic regression model outperforms the baseline 
SVM reported by Socher et al. (20131, who used 
only unigrams but did not specify the weighting 
scheme for their SVM baseline. While our result is 
still below the state-of-the-art based on the the re¬ 


cursive neural tensor networks ( [Socher et al., 2013| l 
and the paragraph vector ( Le and Mikolov, 2014| ), 
we show that logistic regression is comparable 
with recursive and matrix-vector neural networks 
([Socher et al., 2011l|Socher et al., 2012||. 


Amazon electronics (Table |^. The best¬ 
performing methods on this dataset are based 
on convolutional neural networks 
Zhang, 2014| )p1 Our method is on 

"'These are fully connected neural networks with a recti¬ 
fier activation function, trained under ^2 regularization with 
stochastic gradient descent. 


( Johnson and 
par with the 


Method 

Acc. 

Naive Bayes 

81.8 

SVM 

79.4 

Vector average 

80.1 

Recursive neural networks 

82.4 

LR (this work) 

82.4 

Matrix-vector RNN 

82.9 

Recursive neural tensor networks 

85.4 

Paragraph vector 

87.8 


Table 4: Comparisons on the Stanford sentiment treebank 
dataset. Scores are as reported by |Socher et al. (2013| l and |Le| 
and Mikolov (2014[>. 


second-best of these, outperforming all of the 
reported feed-forward neural networks and SVM 
variants Johnson and Zhang used as baselines. 
They varied the representations, and used log term 
frequency and normalization to unit vectors as the 
weighting scheme, after finding that this outper¬ 
formed term frequency. Our method achieved the 
best performance with binary weighting, which 
they did not consider. 


Method 

Acc. 

SVM-unigrams 

88.62 

SVM-{1, 2}-grams 

90.70 

SVM-{l,2,3}-grams 

90.68 

NN-unigrams 

88.94 

NN-{1, 2}-grams 

91.10 

NN-{1, 2, 3}-grams 

91.24 

LR (this work) 

91.56 

Bag of words CNN 

91.58 

Sequential CNN 

92.22 


Table 5: Comparisons on the Amazon electronics dataset. 
Scores are as reported by Johnson and Zhang (2014 1 . 


IMDB reviews (Table |^. The results parallel 
those for Amazon electronics; our method comes 











































close to convoluti onal neural networks ( [Johnson 
and Zhang, 2014), which are state-of-the-art|^ It 


outperforms SVMs and feed-forward neural net¬ 
works, the restricted Boltzmann machine approach 
presented hy Dahl et al. (2012), and compressive 
feature learning (Paskov et al., 201^p| 


Method 

Acc. 

SVM-unigrams 

88.69 

SVM-{1, 2}-grams 

89.83 

SVM-{l,2,3}-grams 

89.62 

RBM 

89.23 

NN-unigrams 

88.95 

NN-{1, 2}-grams 

90.08 

NN-{1, 2, 3}-grams 

90.31 

Compressive feature learning 

90.40 

LR-{1, 2, 3,4, 5}-grams 

90.60 

LR (this work) 

90.85 

Bag of words CNN 

Sequential CNN 

91.03 

91.26 


eluding the distributed structured output model 
( Srikumar and Manning, 2014| )p| The strong lo¬ 
gistic regression baseline from [Paskov et al. (2013| l 
uses all 5-grams, heuristic normalization, and elas¬ 
tic net regularization; our method found that uni¬ 
grams and bigrams, with binary weighting and I 2 
penalty, achieved far better results. 


Method 

Acc. 

Discriminative RBM 

76.20 

Compressive feature learning 

83.00 

LR-{1, 2, 3,4, 5}-grams 

82.80 

Distributed structured output 

84.00 

LR (this work) 

87.84 


Table 8: Comparisons on the 20 Newsgroups dataset for 
classifying documents into all topics. The disriminative RBM 
result is from [Larochelle^and Bengio (2008] l; compressive 
feature learning and LR-5-grams results are from |Paskov et| 
al. (2013[, and the distributed structured output result is from 
Srikumar and Manning (20T4^. 


Table 6: Comparisons on the IMDB reviews dataset. SVM re¬ 
sults are from |Wang and Manning (2012|l, the RBM ( restricted 
Bolzmann machine) re sult is from|Dahl et al. (20121 , NN and 
CNN results are from [Johnson and Zhang (2014[ l, and LR- 
{1, 2, 3,4, 5}-grams and compressive feature learning results 
are fromp 


Paskov et al. (2013 1 . 


Congressional vote (Table |^. Our method out¬ 


performs the best reported results of Yessenahna et 


[al. (2010| |, which use a multi-level structured model 
based on a latent-variable SVM. We show compar¬ 
isons to two well-known but weaker baselines, as 
well. 


Method 

Acc. 

SVM-link 

71.28 

Min-cut 

75.00 

SVM-SLE 

77.67 

LR (this work) 

78.59 


Table 7: Comparisons on the U.S. congressional vote dataset. 


SVM-link exploits link structures 1 

Thomas et al., 2006); the 

min-cut result is from|Bansal et a 

l. (2008); and SVM-SLE 

result is reported by 

Yessenalina et al. (2010|. 


20 Newsgroups: all topics (Table |^. Our 

method outperforms state-of-the-art methods in- 


20 Newsgroups: talk, religion.misc 

vs. alt. atheism and comp. graphics 

Wang and Manning[ 


vs. comp. windows . x 


(2012 1 report a bigram naive Bayes model achiev¬ 
ing 85.1% and 91.2% on these tasks, respectively!^ 
Our method achieves 86.3% and 92.1% using 
slightly different setups (see Table |^. 


5 Discussion 


Raw text as input and other hyperparameters. 

Our results suggest that seemingly mundane rep¬ 
resentation choices can raise the performance of 
simple linear models to be comparable with much 
more sophisticated models. Achieving these re¬ 
sults is not a matter of deep expertise about the 
domain or engineering skill; the choices can be au¬ 
tomated. Our experiments only considered logistic 
regression with downcased text; more choices— 
stemming, count thresholding, normalization of 
numbers, etc.—can be offered to the optimizer, as 
can additional feature options like gappy n-grams. 

As NLP becomes more widely used in applica¬ 
tions, we believe that automating these choices will 
be very attractive for those who need to train a 
high-performance model quickly. 


^As noted, semi-supervised and ensemble methods are 
excluded for a fair comparison. 

®This approach is based on minimum description length, 
using unlabeled data to select a set of higher-order n-grams 
to use as features. It is technically a semi-supervised method. 
The results we compare to use logistic regression with elastic 
net regularization and heuristic normalizations. 


^This method was designed for structured prediction, but 
[Srikumar and Manning (2014) also applied it to classification. 
It attempts to learn a distributed representation for features and 
for labels. The authors used unigrams and did not elaborate 
the weighting scheme. 

*They also report a naive Bayes/SVM ensemble achieving 
87.9% and 91.2%. 
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Figure 1: Classification accuracies on development data for Amazon electronics (left), Stanford sentiment treebank (center), 
and congressional vote (right) datasets. In each plot, the green solid line indicates the best accuracy found so far, while the dotted 
orange line shows accuracy at each trial. We can see that in general the model is able to obtain reasonably good representation in 
30 trials. 


Optimized representations. For each task, the 
chosen representation is different. Out of all pos¬ 
sible hyperparameter choices in our experiments 
(Table [T]), each of them is used by at least one of 
the datsets (Table |^. For example, on the Con¬ 
gressional Vote dataset, we only need to use bi¬ 
grams, whereas on the Amazon electronics dataset 
we need to use unigrams, bigrams, and trigrams. 
The binary weighting scheme works well for most 
of the datasets, except the sentence-level sentence 
analysis task, where the tf-idf weighting scheme 
was selected. £2 regularization was best in all cases 
but one. 

We do not believe that an NLP expert would 
be likely to make these particular choices, except 
through the same kind of trial-and-error process 
our method automates efficiently. Often, we be¬ 
lieve, researchers in NLP make initial choices and 
stick with them through all experiments (as we have 
admittedly done with logistic regression). Optimiz¬ 
ing over more of these choices will give stronger 
baselines. 


instances of personalized machine learning. The 
Bayesian optimization approach that we use in our 
experiments is performed sequentially. It attempts 
to predict what set of hyperparameters we should 
try next based on information from previous trials. 
There has been work to parallelize Bayesian opti¬ 
mization, making it possible to leverage the power 


of multicore architectures (jSnoek et al., 2012 

De- 

sautels et al., 2012[ 

Hutter et al., 2012l. 


Transfer learning and multitask setting. We 

treat each dataset independently and create a sep¬ 
arate model for each of them. It is also possible 
to learn from previous datasets (i.e., transfer learn¬ 
ing) or to learn from all datasets simultaneously 
(i.e., multitask learning) to improve performance. 
This has the potential to reduce the number of trials 
required even further. See Bardenet et al. (2013| l, 
Swersky et al. (2013|), and Yogatama and Mann 


(2014p for how to perform Bayesian optimization 


in these settings. 


Training time. We ran 30 trials for each dataset 
in our experiments. Figure [T] shows each trial accu¬ 
racy and the best accuracy on development data as 
we increase the number of trials for three datasets. 
We can see that 30 trials are generally enough 
for the model to obtain good results, although the 
search space is large. 

In the presence of unlimited computational re¬ 
sources, Bayesian optimization is slower than grid 
search on all hyperparameters, since the latter is 
easy to parallelize. This is not realistic in most 
research and development environments, and it is 
certainly impractical in increasingly widespread 


Beyond linear models. We use logistic regres¬ 
sion as our classification model, and our experi¬ 
ments show how simple linear models can be com¬ 
petitive with more sophisticated models given the 
right representation. Other models, can be consid¬ 
ered, of course, as can ensembles ([Yogatama and 


Mann, 2014). Increasing the number of options 


may lead to a need for more trials, and evaluating 
/(x) (e.g., training the neural network) will take 
longer for more sophisticated models. We have 
demonstrated, using one of the simplest classifica¬ 
tion models (logistic regression), that even simple 
choices about text representation can matter quite 
a lot. 










































Structured prediction problems Our frame¬ 
work could also be applied to structured prediction 
problems. For example, in part-of-speech tagging, 
the set of features can include character n-grams, 
word shape features, and word type features. The 
optimal choice for different languages is not always 
the same, our approach can automate this process. 

Beyond supervised learning. Our framework 
could also be extended to unsupervised and semi- 
supervised models. For example, in document clus¬ 
tering (e.g., fc-means), we also need to construct 
representations for documents. Log-likelihood 
might serve as a performance function. A range of 
random initializations might be considered. Inves¬ 
tigation of this approach for nonconvex problems 
like clustering is an exciting area for future work. 

6 Conclusion 

We used a Bayesian optimization approach to opti¬ 
mize choices about text representations for various 
categorization problems. Our sequential model- 
based optimization technique identifies settings for 
a standard linear model (logistic regression) that 
are competitive with far more sophisticated state- 
of-the-art methods on topic classification and senti¬ 
ment analysis. Every task and dataset has its own 
optimal choices; though relatively uninteresting to 
researchers and not directly linked to domain or 
linguistic expertise, these choices have a big effect 
on performance. We see our approach as a first step 
towards black-box NLP systems that work with raw 
text and do not require manual tuning. 
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